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Abstract 

Images can vary according to changes in viewpoint, resolution, noise, and illu- 
mination. In this paper, we aim to learn representations for an image, which are 
robust to wide changes in such environmental conditions, using training pairs of 
matching and non-matching local image patches that are collected under various 
environmental conditions. We present a regularized discriminant analysis that em- 
phasizes two challenging categories among the given training pairs: (1) matching, 
but far apart pairs and (2) non-matching, but close pairs in the original feature 
space (e.g., SIFT feature space). Compared to existing work on metric learning 
and discriminant analysis, our method can better distinguish relevant images from 
irrelevant, but look-alike images. 

1 Introduction 

In many computer vision problems, images are compared using their local descriptors. A local 
descriptor is a feature vector, representing characteristics of an interesting local part in an image. 
Scale-invariant feature transform (SIFT) Q is popularly used for extracting interesting parts and 
their local descriptors from an image. Then comparing two images is done by aggregating pairs 
between each local descriptor in one image and its closest local descriptor in another image, whose 
pairwise distances are below some threshold. The assumption behind this procedure is that local 
descriptors corresponding to the same local part ("matching descriptors") are usually close enough 
in the feature space, whereas local descriptors belonging to different local parts ("non-matching 
descriptors") are far apart. 

However, this assumption does not hold when there are significant changes in environmental condi- 
tions (e.g., viewpoint, illumination, noise, and resolution) between two images. For the same local 
part, varying environment conditions can yield varying local image patches, leading to matching 
descriptors far apart in the feature space. On the other hand, for different local parts, their image 
patches can look similar to each other in some environmental conditions, leading to non-matching 
descriptors close together. Fig. [T] shows some examples: in each triplet, the first two image patches 
belong to the same local part but captured under different environment conditions, while the third 
patch belongs to a different part but looks similar to the second one, resulting that the SIFT descrip- 
tors between non-matching local parts are closer than those between matching parts. Consequently, 
comparing two images using their local descriptors cannot be done correctly when their are signifi- 
cant differences in environmental conditions between the images. Fig. [2ja) shows the cases. 

In this paper, we address this problem by learning more robust representations for local image 
patches where matching parts are more similar together than non-matching parts even under widely 
varying environmental conditions. 

*The full version of this manuscript is currently under review in an international journal. 
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SIFT: 290 > 221 
LDE: 305 > 278 

Ours: 349 < 365 

SIFT: 268 > 232 
LDE: 283 < 291 

Ours: 301 < 335 

Figure 1 : Some examples where a local part (center in each triplet) is closer to a non-matching part 
(right) than a matching part (left) in terms of the Euclidean distances between their SIFT descriptors. 
Using linear discriminant embedding (LDE) (T), non-matching pairs are still closer than matching 
pairs in the first three examples. Compared to existing work on metric learning and discriminant 
analysis, our learning method focuses more on "far but matching" and "close but non-matching" 
training pairs, so that can distinguish look-alike irrelevant parts successfully. 
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SIFT: 304 > 231 SIFT: 336 > 246 

LDE: 336 > 275 LDE: 371 > 314 

Ours: 360 < 425 Ours: 362 < 372 




SIFT: 213 > 199 SIFT: 267 > 257 

LDE: 268 « 264 LDE: 240 < 319 

Ours: 295 < 388 Ours: 257 < 405 





(a) 15 closest SIFT pairs 




(b) 15 closest RDE pairs 



Figure 2: (a) When two images of the same scene are captured under considerably different con- 
ditions, many irrelevant pairs of local parts are chosen as closest pairs in the local feature space, 
which may lead to undesirable results of comparison, (b) In our RDE space, matching pairs are 
successfully chosen as closest pairs. 



2 Proposed Method 

In descriptor learning d El, a projection is obtained from training pairs of matching and non- 
matching descriptors in order to map given local descriptors (e.g., SIFT) to a new feature space 
where matching descriptors are closer to each other and non-matching descriptors are farther from 
each other. Traditional techniques for supervised dimensionality reduction, including linear discrim- 
inant analysis (LDA) and local Fisher discriminant analysis (LFDA) [4], can be applied to descriptor 
learning after a slight modification. For example, linear discriminant embedding (LDE) (3 is come 
from LDA with a simple modification for handling pairwise training data. 

We propose a regularized learning framework in order to further emphasize (1) matching, but far 
apart pairs and (2) non-matching, but look-alike pairs, under wide environmental conditions. First, 
we divide given training pairs of local descriptors into four subsets, Relevant-Near (Rel-Near), 
Relevant-Far (Rel-Far), Irrelevant-Near (Irr-Near), and Irrelevant- Far (Irr-Far). For example, the 
"Irr-Near" subset consists of irrelevant (i.e., non-matching), but near pairs. We define an irrelevant 
pair (sc£, Xj) as "near" if Xi is one of the k nearest descriptors" among all non-matching descriptors 
of Xj or vice versa. Similarly, a relevant pair (a^, Xj) is called "near" if Xi is one of k nearest de- 
scriptors among all matching descriptors of Xj. All the other pairs belong to "Irr-Far" or "Rel-Far". 
Then we seek a linear projection T that maximizes the following regularized ratio: 

j/ T ) = & IN T,(i,j)ev IN d ij( T ) + Pif E(jj)ev IF d ij( T ) (1) 



*In our experiments, setting 1 < k < 10 achieved a reasonable performance improvement. 
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(a) (b) 
Figure 3: Toy examples of projections learned by LDE, LFDA, and our RDE. 
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(a) SIFT feature space 



(b) LDE feature space 



(c) Our RDE feature space 



Figure 4: Distribution of Euclidean distance in a given feature space for each subset of pairs. Err 
(Rel vs Irr) measures the proportion of overlapping region between {Rel-Near, Rel-Far} and {Irr- 
Near, Irr-Far}, while Err (RFar vs INear) measures the overlap between Rel-Far and Irr-Near. In 
our RDE space, non-matching pairs are well distinguished from matching pairs. 



where (T) denotes the squared distance | \T(xi —Xj)\\ 2 between two local descriptors Xi and Xj 
in the projected space, and Vrn ^Vrf^Vin , Vif denote the subsets of Rel-Near, Rel-Far, Irr-Near, 
and Irr-Far, respectively. Four regularization constants Prn , Prf , Pin, Pif control the importance 
of each subset. 

• In LDE, all pairs are equally important, i.e., Prist = /3rf = Pin — Pif = 1. 

• In LFDA , "near" pairs are more important, i.e., (3rn ^> Prf and Pin ^> Pif- 

• In our method, we propose to emphasize Rel-Far (matching but far apart) and Irr-Near 
(non-matching but close) pairs, i.e., Prn <C Prf and Pin ^ Pif- 

Fig. [3] shows when and why our method can better distinguish Irr-Near pairs from Rel-Far pairs. 
In Fig. |3ja), the global intra-class distribution forms a diagonal, while each local cluster has no 
meaningful direction of scattering. Since LFDA focuses on "near" pairs, it cannot capture the true 
intra-class scatter well, leading to the undesirable projection. In Fig. [3jb), LDE obtains a projection 
that maximizes the inter-class variance, but the shape of the class boundary cannot be considered 
well, leading to an overlap between two classes. In this case, focusing more on Irr-Near pairs (i.e., 
the pairs of opposite clusters near the class boundary) can preserve the separability of classes. 

Fig. [4] shows the distance distribution of local descriptors, where 20,000 pairs of each subset are 
randomly chosen from 500,000 local patches of Flickr images. As shown in Fig. |4ja), Rel-Near 
and Irr-Far pairs are already well separated in the SIFT space, but Rel-Far and Irr-Near pairs are 
not distinguished well (~30% overlapped) and many Rel-Far pairs lie farther than Irr-Near pairs. 
Learning by LDE can achieve only a marginal improvement (Fig. |4jb)). By contrast, our RDE 
achieves a significant improvement in the separability between matching and non-matching pairs, 
especially two challenging subsets, Rel-Far and Irr-Near (Fig. |4jc)). Fig. [T] and [2] also show the 
superiority of our method over the existing work. 
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